Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally disable logging in the data sampler to support predict_step #10127

Merged
merged 11 commits into from
Aug 21, 2024

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Aug 13, 2024

Pytorch lightning's predict_step does not support logging on the module. This PR adds an option to the data sampler to disable logging which allows the predict step to work.

@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from baf1c1f to c583c91 Compare August 13, 2024 17:54
raise ValueError("non-None value not found")


def get_dtype_device(torch_object) -> Tuple[torch.dtype, torch.device]: # noqa: D103

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note test

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.


# NOTE(SKH): These types are all wrong, but are close. The inner type must always be a torch.Tensor, but the outer container should be generic.
def batch_collator(batches: Optional[Union[Tuple[ReductionT], List[ReductionT]]]) -> Optional[ReductionT]:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note test

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
case [list(), *_]:
return [batch_collator([batch[i] for batch in batches]) for i in range(len(batches[0]))]
case None:
return None

Check warning

Code scanning / CodeQL

Unreachable code Warning test

This statement is unreachable.
@jstjohn jstjohn self-assigned this Aug 13, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch 2 times, most recently from f278078 to 479e2d8 Compare August 14, 2024 19:01
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 14, 2024
@ko3n1g ko3n1g removed the Run CICD label Aug 14, 2024
@ko3n1g
Copy link
Collaborator

ko3n1g commented Aug 14, 2024

stopping this for now until our CI is out of maintenance

@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from ac2075b to b62a392 Compare August 15, 2024 21:04
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from eddcbb4 to 6f4e2d9 Compare August 15, 2024 21:21
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 6f4e2d9 to 5766f47 Compare August 16, 2024 16:35
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 16, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from e2672ee to 88d9df9 Compare August 16, 2024 19:00
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 16, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 88d9df9 to 92d6307 Compare August 16, 2024 21:18
@jstjohn jstjohn removed the Run CICD label Aug 16, 2024
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 20, 2024
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from d0258da to 0da36a9 Compare August 20, 2024 21:11
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 20, 2024
@ashors1
Copy link
Collaborator

ashors1 commented Aug 21, 2024

Looks good, thanks! Can we clean up some of the old comments in the test script?

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
…s to cwd

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
@jstjohn jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 236f257 to a6ff157 Compare August 21, 2024 19:35
@jstjohn jstjohn added Run CICD and removed Run CICD labels Aug 21, 2024
Copy link
Collaborator

@ashors1 ashors1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@jstjohn jstjohn merged commit ff7c614 into main Aug 21, 2024
129 of 130 checks passed
@jstjohn jstjohn deleted the jstjohn/optional_logging_for_predict branch August 21, 2024 23:38
WoodieDudy pushed a commit to WoodieDudy/NeMo that referenced this pull request Aug 26, 2024
NVIDIA#10127)

* Resolve merge conflicts with consumed sample logging

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add test file that captures the predict step error

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add fixme comment around proper checkpoint nemo2 handling

Signed-off-by: John St John <jstjohn@nvidia.com>

* Skip megatron training test on CPU nodes

Signed-off-by: John St John <jstjohn@nvidia.com>

* Move output_log to last arg for compatibility

Signed-off-by: John St John <jstjohn@nvidia.com>

* try setting the default root dir in predict to avoid writing artifacts to cwd

Signed-off-by: John St John <jstjohn@nvidia.com>

* Handle the new check for batch samplers to enable predict_step

Signed-off-by: John St John <jstjohn@nvidia.com>

* Only reset the global microbatch, not entire parallel state

Signed-off-by: John St John <jstjohn@nvidia.com>

* Destroy the right sets of state in test of lightning trainer

Signed-off-by: John St John <jstjohn@nvidia.com>

* Fix typo and rename state resetting functions

Signed-off-by: John St John <jstjohn@nvidia.com>

* Run test in a subprocess to avoid contaminating global state

Signed-off-by: John St John <jstjohn@nvidia.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
hemildesai pushed a commit that referenced this pull request Aug 28, 2024
#10127)

* Resolve merge conflicts with consumed sample logging

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add test file that captures the predict step error

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add fixme comment around proper checkpoint nemo2 handling

Signed-off-by: John St John <jstjohn@nvidia.com>

* Skip megatron training test on CPU nodes

Signed-off-by: John St John <jstjohn@nvidia.com>

* Move output_log to last arg for compatibility

Signed-off-by: John St John <jstjohn@nvidia.com>

* try setting the default root dir in predict to avoid writing artifacts to cwd

Signed-off-by: John St John <jstjohn@nvidia.com>

* Handle the new check for batch samplers to enable predict_step

Signed-off-by: John St John <jstjohn@nvidia.com>

* Only reset the global microbatch, not entire parallel state

Signed-off-by: John St John <jstjohn@nvidia.com>

* Destroy the right sets of state in test of lightning trainer

Signed-off-by: John St John <jstjohn@nvidia.com>

* Fix typo and rename state resetting functions

Signed-off-by: John St John <jstjohn@nvidia.com>

* Run test in a subprocess to avoid contaminating global state

Signed-off-by: John St John <jstjohn@nvidia.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
adityavavre pushed a commit to adityavavre/NeMo that referenced this pull request Sep 15, 2024
NVIDIA#10127)

* Resolve merge conflicts with consumed sample logging

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add test file that captures the predict step error

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add fixme comment around proper checkpoint nemo2 handling

Signed-off-by: John St John <jstjohn@nvidia.com>

* Skip megatron training test on CPU nodes

Signed-off-by: John St John <jstjohn@nvidia.com>

* Move output_log to last arg for compatibility

Signed-off-by: John St John <jstjohn@nvidia.com>

* try setting the default root dir in predict to avoid writing artifacts to cwd

Signed-off-by: John St John <jstjohn@nvidia.com>

* Handle the new check for batch samplers to enable predict_step

Signed-off-by: John St John <jstjohn@nvidia.com>

* Only reset the global microbatch, not entire parallel state

Signed-off-by: John St John <jstjohn@nvidia.com>

* Destroy the right sets of state in test of lightning trainer

Signed-off-by: John St John <jstjohn@nvidia.com>

* Fix typo and rename state resetting functions

Signed-off-by: John St John <jstjohn@nvidia.com>

* Run test in a subprocess to avoid contaminating global state

Signed-off-by: John St John <jstjohn@nvidia.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: adityavavre <aditya.vavre@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants